Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
File type detection algorithm based on principal component analysis and K nearest neighbors
YAN Mengdi, QIN Linlin, WU Gang
Journal of Computer Applications    2016, 36 (11): 3161-3164.   DOI: 10.11772/j.issn.1001-9081.2016.11.3161
Abstract585)      PDF (583KB)(480)       Save
In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate, a new content-based file-type detection algorithm was proposed, which was based on Principal Component Analysis (PCA) and K Nearest Neighbors ( KNN). Firstly, PCA algorithm was used to reduce the dimension of the sample space. Then by clustering the training samples, each file type was represented by cluster centroids. In order to reduce the error caused by unbalanced training samples, KNN algorithm based on distance weighting was proposed. The experimental result shows that the improved algorithm, in the case of a large number of training samples, can reduce computational complexity, and can maintain a high recognition accuracy rate. This algorithm doesn't depend on the feature of each file, so it can be used more widely.
Reference | Related Articles | Metrics
Data deduplication in Web information integration
LIU Xueqiong WU Gang DENG Houping
Journal of Computer Applications    2013, 33 (09): 2493-2496.   DOI: 10.11772/j.issn.1001-9081.2013.09.2493
Abstract578)      PDF (645KB)(401)       Save
Since traditional data dedupliation methods are of low time efficiency and detection accuracy, a Stepwise Clustering Data Elimination (SCDE) method was presented based on the features of Web information integration. Firstly the whole record set was divided into sub-sets using both key attributes division and the Canopy clustering technique, and then the similar records in each sub-set were accurately eliminated. A fuzzy entity matching strategy based on dynamic weight was proposed to accurately eliminate the duplicate records, which reduced the influence of missing attribute on record similarity calculation, and the name of company was especially treated to improve the matching accuracy. The results show that the method is superior to traditional algorithms in time efficiency and detection accuracy, and the precision is improved by 12.6%. The method is applied in forestry yellow page system and performs well.
Related Articles | Metrics